Introduction

The goal of this project is to use machine learning to predict the winner of UFC fights. We will use a large dataset from Kaggle, some machine learning magic, and create an accurate model to solve this binary classification problem.

#Packages/Chunk Settings from Luke Fields' Project
library(corrplot)  # for the correlation plot
library(discrim)  # for linear discriminant analysis
library(corrr)   # for calculating correlation
library(knitr)   # to help with the knitting process
library(MASS)    # to assist with the markdown processes
library(tidyverse)   # using tidyverse and tidymodels for this project mostly
library(tidymodels)
library(ggplot2)   # for most of our visualizations
library(ggrepel)
library(ggimage)
library(rpart.plot)  # for visualizing trees
library(vip)         # for variable importance 
library(vembedr)     # for embedding links
library(janitor)     # for cleaning out our data
library(randomForest)   # for building our randomForest
library(stringr)    # for matching strings
library("dplyr")     # for basic r functions
library("yardstick") # for measuring certain metrics
tidymodels_prefer()

knitr::opts_chunk$set(   # basic chunk settings

    fig.height = 5,
    fig.width = 7,
    tidy = TRUE,
    tidy.opts = list(width.cutoff = 60)
)
opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
options(digits = 4)


indent1 = '    '        # basic indent settings
indent2 = '        '
indent3 = '            '

What is the UFC?

The Ultimate Fighting Champion (UFC) is a professional mixed martial arts league. In the UFC, two opponents of the same sex and weight class fight. Each round is scheduled for 5 minutes, and each fight is scheduled for 3 or 5 rounds. However, not every fight goes on for as long as scheduled. The winner may finish the fight early by “submitting” or “knocking out” their opponent. The UFC has gained worldwide popularity, hosting events in arenas around the world.

Why is this model relevant?

Our goal is to take in career statistics of two fighters and predict who will win the fight. Being able to predict the winner of an upcoming fight has massive implications for betting, as well as bragging rights among your friends. More importantly, this project will help me better understand what factors truly make a good match up in a fight, and will hopefully uncover some hidden trends among winning UFC fighters.

Project Roadmap

Now that we know the background and importance of our model, let’s discuss how we are going to build it throughout this project. First we will import our data set from kaggle. There will need to be some initial data manipulation and cleaning, and then we will perform some exploratory data analysis to get a better understanding of our data set. Our goal is to use predictor variables to predict a binary class “winner,” which will be our response variable, detailing whether the “blue” fighter or the “red” fighter will win a given match-up.

Each row in our data set has data on a single fight. Each fighter is classified (arbitrarily) as the red fighter or the blue fighter. Career statistics (average significant strikes landed, total wins, height, etc.) are then given for both the red fighter and the blue fighter.

After tidying, we will perform a training/test split on our data, make a recipe, and set folds for the 10-fold cross validation we will implement. Logistic Regression, Linear Discriminant Analysis, Decision Tree, Random Forest, and K-Nearest Neighbor models will be all used to model the training data when we finish the setup. Whichever model performs the strongest, we will fit to our testing data set and analyze how effective our model can truly be.

Exploratory Data Analysis

Before diving into modeling, it’s essential to understand the structure and content of our data. This large dataset contains 95 variables, so we need to tidy it up for analysis. Tidying includes transforming categorical variables into factors and handling any missing values. In this section, we’ll manipulate and clean our data, followed by analyzing key variables using visualizations and various functions.

Loading and Exploring Raw Data.

Lets read in our data. When reading in the dimensions of this dataset, we notice we’ve got 95 variables. Fortunately, many of these variables are unnecessary for achieving our goal.

ufc_original_dataset <- read_csv("ufc_dataset.csv")
dim(ufc_original_dataset)
## [1] 7439   95

Morphing Our Data

Although we will only be able to train and test our models on fights which have already occurred (as we need to know the real outcome), I would like to be able to use this machine to predict the outcome of upcoming fights. For that purpose, I am dropping predictors from the dataset which are only known after the fight, such as number of strikes exchanged in the fight or take downs attempted in the fight. I simply want each fighter’s average career stats. I am focused on predictors which are known before the fight, such as a fighter’s average striking accuracy or average take down attempts per minute. This leaves us to consider the following variables when creating our recipe.

  • weight_class: The weight class in which the fight took place
  • gender: Gender of fighters
  • r_wins_total: Number of total wins by Red fighter
  • b_wins_total: Number of total wins by Blue fighter
  • r_losses_total: Number of total losses by Red fighter
  • b_losses_total: Number of total losses by Blue fighter
  • r_age: Current age of Red fighter
  • b_age: Current age of Blue fighter
  • r_height: Height in cm of Red fighter
  • b_height: Height in cm of Blue fighter
  • r_weight: Weight in kg of Red fighter
  • b_weight: Weight in kg of Blue fighter
  • r_reach: Reach in cm of Red fighter
  • b_reach: Reach in cm of Blue fighter
  • r_stance: Stance of Red fighter
  • b_stance: Stance of Blue fighter
  • r_SLpM_total: Career Significant Strikes Landed per Minute by Red fighter
  • b_SLpM_total: Career Significant Strikes Landed per Minute by Blue fighter
  • r_SApM_total: Career Significant Strikes Absorbed per Minute by Red fighter
  • b_SApM_total: Career Significant Strikes Absorbed per Minute by Blue fighter
  • r_sig_str_acc_total: Career Significant Striking Accuracy by Red fighter
  • b_sig_str_acc_total: Career Significant Striking Accuracy by Blue fighter
  • r_td_acc_total: Career takedown Accuracy by Red fighter
  • b_td_acc_total: Career takedown Accuracy by Blue fighter
  • r_str_def_total: Career Significant Strike Defence (the % of opponents strikes that did not land) by Red fighter
  • b_str_def_total: Career Significant Strike Defence (the % of opponents strikes that did not land) by Blue fighter
  • r_td_def_total: Career Takedown Defense (the % of opponents TD attempts that did not land) by Red fighter
  • b_td_def_total: Career Takedown Defense (the % of opponents TD attempts that did not land) by Blue fighter
  • r_sub_avg: Career Average Submissions Attempted per 15 minutes by Red fighter
  • b_sub_avg: Career Average Submissions Attempted per 15 minutes by Blue fighter
  • r_td_avg: Career Average Takedowns Landed per 15 minutes by Red fighter
  • b_td_avg: Career Average Takedowns Landed per 15 minutes by Blue fighter
# Select Predictors Listed Above
ufc_trimmed_data <- ufc_original_dataset %>%
    select(winner, weight_class, gender, 
         r_wins_total, b_wins_total, 
         r_losses_total, b_losses_total,
         r_age, b_age, 
         r_height, b_height, 
         r_weight, b_weight, 
         r_reach, b_reach, 
         r_stance, b_stance,
         r_SLpM_total, b_SLpM_total, 
         r_SApM_total, b_SApM_total,
         r_sig_str_acc_total, b_sig_str_acc_total,
         r_td_acc_total, b_td_acc_total,
         r_str_def_total, b_str_def_total,
         r_td_def_total, b_td_def_total,
         r_sub_avg, b_sub_avg, 
         r_td_avg, b_td_avg)

By checking the dimensions below, we see that we’ve now got 33 variables. These include 32 predictors, and a single outcome variable.

dim(ufc_trimmed_data)
## [1] 7439   33

This is still more predictors than is necessary to deal with. We must remember that we are focused on match-ups, not individual fighters. Rather than having each fighters average career stats listed separately, we can create new variables recording the difference between each fighter’s career stats. This will provide more valuable information surrounding the match-up itself, and will be easier to work with.

#Create new 'difference' variables, calculated as 'RED - BLUE'

matchup_data <- ufc_trimmed_data %>%
   mutate(
    career_wins_diff = r_wins_total - b_wins_total,
    career_losses_diff = r_losses_total - b_losses_total,
    height_diff = r_height - b_height,
    weight_diff = r_weight - b_weight,
    reach_diff = r_reach - b_reach,
    SLpM_diff = r_SLpM_total - b_SLpM_total,
    SApM_diff = r_SApM_total - b_SApM_total,
    sig_str_acc_diff = r_sig_str_acc_total - b_sig_str_acc_total,
    td_acc_diff = r_td_acc_total - b_td_acc_total,
    str_def_diff = r_str_def_total - b_str_def_total,
    td_def_diff = r_td_def_total - b_td_def_total,
    sub_avg_diff = r_sub_avg - b_sub_avg,
    td_avg_diff = r_td_avg - b_td_avg
  ) %>%
  select(winner, weight_class, gender, r_age, b_age, r_stance, b_stance, 
         career_wins_diff, career_losses_diff, height_diff, weight_diff, reach_diff, 
         SLpM_diff, SApM_diff, 
         sig_str_acc_diff, td_acc_diff,
         str_def_diff, td_def_diff,
         sub_avg_diff, td_avg_diff)

Now, we have got 19 predictor variables and a single outcome variable. Remember, we will use our be using our predictor variables to “predict” the value of our outcome variable, “winner”.

dim(matchup_data)
## [1] 7439   20

Tidying Our Data

At this point, there are two predictors I want to examine before moving forward with our EDA. These predictors are gender and weight_class. First, I notice that around 90% of my data involves fights between men.

# Get the number of fights per gender
fights_per_gender <- matchup_data %>%
  group_by(gender) %>%
  summarise(number_of_fights = n())

# Create a bar plot
ggplot(fights_per_gender, aes(x = gender, y = number_of_fights, fill = gender)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Men" = "goldenrod", "Women" = "darkgrey")) +
  labs(title = "Number of Fights per Gender", x = "Gender", y = "Number of Fights") 

Also, it is important to remember that every fight occurs between members of the same gender. So, gender is unlikely to be a good predictor of the winner anyways. For these two reasons, we will drop the ‘gender’ predictor from our dataset. We will still be including the observations (fights) occurring between female fighters, we are just simply dropping the “gender” variable from our models.

matchup_data <- matchup_data %>%
  select(-gender)

Now, we have got a shorter, more informative set of predictors for each fight:

  • weight_class: The weight class in which the fight took place
  • r_stance: Stance of Red fighter
  • b_stance: Stance of Blue fighter
  • r_age: Age of Red fighter
  • b_age: Age of Blue fighter
  • career_wins_diff: Difference in total number of wins between Red and Blue
  • career_losses_diff: Difference in total number of losses between Red and Blue
  • height_diff: Difference in height between Red and Blue (in cm)
  • weight_diff: Difference in weight between Red and Blue (in lb)
  • reach_diff: Difference in reach between Red and Blue (in cm)
  • SLpM_diff: Difference in Career Average Significant Strikes Landed per Minute between Red and Blue
  • SApM_diff: Difference in Career Average Significant Strikes Absorbed per Minute between Red and Blue
  • sig_str_acc_diff: Difference in Career Average Significant Striking Accuracy between Red and Blue
  • td_acc_diff: Difference in Career Average Takedown Accuracy between Red and Blue
  • str_def_diff: Difference in Career Average Significant Strike Defence (the % of opponents strikes that did not land) between Red and Blue
  • td_def_diff: Difference in Career Average Takedown Defense (the average % of opponents TD attempts that did not land) between Red and Blue
  • sub_avg_diff: Difference in Career Average Submissions Attempted per 15 minutes between Red and Blue
  • td_avg_diff: Difference in Career Average Takedowns Landed per 15 minutes between Red and Blue

For each ‘difference’ predictor (height_diff, SLpM_diff, etc.), a positive number means that the red fighter’s stat is larger, and a negative number means that the blue fighter’s stat is larger. A zero means that both fighters are equal in the given stat. For example, if height_diff equals 5, the red fighter is 5 centimeters taller than the blue fighter. If height_diff equals -5, the blue fighter is 5 centimeters taller than the red fighter.

Next, we can examine the ‘weight_class’ variable.

# Get the unique weight classes
unique_weight_classes <- unique(matchup_data$weight_class)

# Get the total number of unique weight classes
total_unique_weight_classes <- length(unique_weight_classes)

# Print the total number of unique weight classes
print(total_unique_weight_classes)
## [1] 109

As we see above, there are 109 unique weight class values in our data set. In modern day MMA, there are only 12 weight classes: 8 for men, 4 for women. However, some of the fights in this data occurred before 2001, when the modern rule set was put in place. Our dataset has no ‘date’ variable, however we can remove most of the pre-2001 fights by filtering out observations who’s weight class includes a number (eg: UFC 5 Tournament, Ultimate Ultimate ‘95 Tournament Title). That leaves us with only a few unique weight classes which we can quickly google to discover were never used after 2001. (Ultimate Japan Heavyweight Tournament Title, Super Heavyweights, UFC Superfight Championship, Open Weight). The ’catch weight’ value is a value representing a wde variety of weights, and it is a very uncommon in MMA, so we will remove observations with that weight class as well.

pattern <- "\\b(1[0-9]|2[0-9]|30|[1-9])\\b"

# Drop rows where weight_class contains numbers 1 to 30 (UFC Eventd 1-30)
matchup_data <- matchup_data[!grepl(pattern, matchup_data$weight_class), ]

# Drop rows where weight_class represents a pre-2001 fight, or fights where occured at a super rare weight class
drop_values <- c("Ultimate Japan Heavyweight Tournament Title", 
                 "Ultimate Ultimate '95 Tournament Title", 
                 "Ultimate Ultimate '96 Tournament Title",
                 "UFC Superfight Championship",
                "Super Heavyweight",
                "Open Weight",
                "Catch Weight")

matchup_data <- matchup_data[!matchup_data$weight_class %in% drop_values, ]

As we can see below, these changes brought our data set down to 40 unique weight class values. This is better, but still not quite what we are looking for.

# Get the list of unique weight classes
unique_weight_classes <- unique(matchup_data$weight_class)

# Get the total number of unique weight classes
total_unique_weight_classes <- length(unique_weight_classes)
total_unique_weight_classes
## [1] 40

We still need to get that number down to 12. The extra 28 weight class values are representative of modern weight classes, they just need to be cleaned up to their base state. For example, weight class “UFC Heavyweight Title,” will become “Heavyweight.”

matchup_data <- matchup_data %>% 
  mutate(weight_class = case_when(
     weight_class == "UFC Bantamweight Title" ~ "Bantamweight",
     weight_class == "UFC Featherweight Title" ~ "Featherweight",
      weight_class == "UFC Flyweight Title" ~ "Flyweight",
     weight_class == "UFC Heavyweight Title" ~ "Heavyweight",
      weight_class == "UFC Interim Bantamweight Title" ~ "Bantamweight",
     weight_class == "UFC Interim Featherweight Title" ~ "Featherweight",
      weight_class == "UFC Interim Flyweight Title" ~ "Flyweight",
     weight_class == "UFC Interim Heavyweight Title" ~ "Heavyweight",
         weight_class == "UFC Interim Light Heavyweight Title" ~ "Light Heavyweight",
         weight_class == "UFC Interim Lightweight Title" ~ "Lightweight",
         weight_class == "UFC Interim Middleweight Title" ~ "Middleweight",
         weight_class == "UFC Interim Welterweight Title" ~ "Welterweight",
         weight_class == "UFC Light Heavyweight Title" ~ "Light Heavyweight",
         weight_class == "UFC Lightweight Title" ~ "Lightweight",
         weight_class == "UFC Middleweight Title" ~ "Middleweight",
        weight_class == "UFC Welterweight Title" ~ "Welterweight",
        weight_class == "UFC Women's Bantamweight Title" ~ "Women's Bantamweight",
        weight_class == "UFC Women's Featherweight Title" ~ "Women's Featherweight",
      weight_class == "UFC Women's Flyweight Title" ~ "Women's Flyweight",
      weight_class == "UFC Women's Strawweight Title" ~ "Women's Strawweight",
        weight_class == "TUF Nations Canada vs. Australia Middleweight Tournament Title" ~ "Middleweight",
      weight_class == "TUF Nations Canada vs. Australia Welterweight Tournament Title" ~ "Welterweight",
      weight_class == "Ultimate Fighter Australia vs. UK Lightweight Tournament Title" ~ "Lightweight",
      weight_class == "Ultimate Fighter Australia vs. UK Welterweight Tournament Title" ~ "Welterweight",
      weight_class == "Ultimate Fighter China Featherweight Tournament Title" ~ "Featherweight",
      weight_class == "Ultimate Fighter China Welterweight Tournament Title" ~ "Welterweight",
     weight_class == "Ultimate Fighter Latin America Bantamweight Tournament Title" ~ "Bantamweight",
      weight_class == "Ultimate Fighter Latin America Featherweight Tournament Title" ~ "Featherweight",
    .default = weight_class
  ))

unique_weight_classes <- matchup_data %>%
  reframe(weight_class)

table(unique_weight_classes)
## weight_class
##          Bantamweight         Featherweight             Flyweight 
##                   630                   708                   324 
##           Heavyweight     Light Heavyweight           Lightweight 
##                   673                   655                  1270 
##          Middleweight          Welterweight  Women's Bantamweight 
##                   976                  1239                   197 
## Women's Featherweight     Women's Flyweight   Women's Strawweight 
##                    28                   216                   289

As we can see below, we are now down to our 12 weight classes, and we only lost around 100 observations! This data will be much easier to work with. Before building our predictive models, let’s do some Exploratory Data Analysis to get a better idea of what are our data is telling us.

# Get the total number of unique weight classes
total_unique_weight_classes <- length(unique(matchup_data$weight_class))
total_unique_weight_classes
## [1] 12

Visual EDA

Variable Correlation Plot

We can examine a correlation plot in order to find any highly correlated predictors.

matchup_data_numeric <- matchup_data %>%  # getting just the numeric data
  select_if(is.numeric)

matchup_variables_cor <- cor(matchup_data_numeric, use = "complete.obs")  # calculating the correlation between each variable with complete observations

matchup_variables_cor_plt <- corrplot(matchup_variables_cor, method = "square", type = "lower") 

First, height_diff and reach_diff look correlated, which makes sense as taller individuals normally have longer arms, and shorter individuals normally have shorter arms. Significant strikes landed per minute and Significant strikes absorbed per minute are also noticeably positively correlated, which makes sense, because if a fighter lands more punches on average, they are also probably on the receiving end of more punches on average. Career_losses_diff and career_wins_diff are also highly correlated, because if there is a large difference in each fighter’s total wins, there is also porbably a large difference in each fighter’s total losses.

Distribution of Winners

Looking at the distribution of winners, although “Red” and “Blue” are arbitrary titles given to each fighter in each fight, there is about a 33/66 split in the outcome variable “winner.” This split is close enough to 50/50 for us to not resample. The reason it is not even closer to a 50/50 split, even though the titles are arbitrary, is possibly due to the way the data was entered, or simply due to random chance.

ggplot(matchup_data, aes(x = winner, fill = winner)) +
  geom_bar() +
  scale_fill_manual(values = c("Blue" = "blue", "Red" = "red")) +
  labs(title = "Number of Wins by Fighter",
       x = "Winner",
       y = "Count of Wins",
       fill = "Winner") +
  theme_minimal()

Distribution of Winner by Age

The plot below examines how often the younger or the older fighter wins. This graph shows us that the younger fighter wins more often, however, there are plenty of fights in which the older fighter wins as well.

database_for_winners_plot <- matchup_data %>%
  filter(!is.na(r_age) & !is.na(b_age)) %>%
   filter(r_age != b_age) %>%
  mutate(
    winner_age = case_when(
       r_age == b_age ~ "Same Age",
      winner == "Red" & r_age < b_age ~ "Younger",
      winner == "Red" & r_age > b_age ~ "Older",
      winner == "Blue" & b_age < r_age ~ "Younger",
      winner == "Blue" & b_age > r_age ~ "Older"
    )
  )


# Plot the comparison
ggplot(database_for_winners_plot, aes(x = winner_age)) +
  geom_bar(aes(fill = winner_age)) +
  labs(title = "Comparison of Wins by Younger vs. Older Fighters",
       x = "Winner Age Category",
       y = "Count of Wins",
       fill = "Winner Age Category") +
   scale_fill_manual(values = c("Younger" = "goldenrod", "Older" = "darkgray")) +
  theme_minimal()

Height Difference Distribution

The distribution of height difference between fighters, as expected, is normally distributed. Most fights occur between fighters of similar heights.

ggplot(matchup_data, aes(x = height_diff)) +
  geom_histogram(binwidth = 1, fill = "goldenrod", color = "darkgrey") +
  labs(title = "Distribution of Height Difference", x = "Height Difference", y = "Frequency") +
  theme_minimal()

Difference in Average Significant Strikes Landed

This box plot examines the distribution of the “winner” variable based on career average significant strikes landed (how often they significantly punch or kick their opponent per minute, on average). Remember, a positive value reflects that the red fighter lands more significant strikes on average. A negative value reflects that the blue fighter lands more significant strikes on average.

This box plot shows that the fighter with the higher average number of significant strikes wins the fight more often.

ggplot(matchup_data, aes(SLpM_diff)) + 
  geom_boxplot(aes(fill = winner)) +
   scale_fill_manual(values = c("Red" = "red", "Blue" = "blue")) +
  labs(title = "Difference in Average Significant Strikes Landed by Winner", x = "Winner", y = "Difference in Average Significant Strikes Landed")

Setting Up Models

Now we have imported, cleaned and morphed our data. Exploratory Data Analysis has allowed us to explore some trends and confirm that our data set is ready to go. It’s finally time to start setting up our models! We will perform our train / test split, create our recipe, and establish cross validation to help with our models.

Train/Test Split

Before we do any model building, we have to perform a training / testing split on our data. I decided to go with 80/20 split for this data because the testing data set will still have a significant amount of observations, but our model has more to train on and learn. The reason we do this is because we want to avoid over-fitting (when our model predicts our training data well, but performs poorly towards previously unseen data). So, we will use the training set to train our models. We will then choose the best one, and we will use the testing set to deem how accurate our model is in predicting outcomes of new data. We stratify on our response variable, winner. This ensures that our training set and our testing set will have the same proportion of outcome variables (Winner = Blue vs Winner = Red) as the original data set.

set.seed(4)  # setting a seed so the split is the same
matchup_data <- matchup_data %>% 
  mutate(winner = factor(winner),
         weight_class = factor(weight_class),
         r_stance = factor(r_stance),
         b_stance = factor(b_stance))
matchup_split <- matchup_data %>%
  initial_split(prop = 0.8, strata = "winner")

matchup_train <- training(matchup_split) # training split
matchup_test <- testing(matchup_split) # testing split

We’ve now split up our data. Let’s check the dimensions of each set to make sure that we have got plenty of observations in the training and testing sets. There are 5763 observations in our training set…

dim(matchup_train)
## [1] 5763   19

and 1442 observations in our testing set.

dim(matchup_test)
## [1] 1442   19

Recipe Building

Now, we can create the recipe we will fit our models to. In machine learning, a recipe is sort of a one-size-fits-all description of our dataset which relates each of the predictors to the outcome variable. Rather than reading in raw data, our predictive models will read in this recipe. First, we need to fill in any missing values. There are missing values for reach_diff, r_age, b_age and stance. We estimate any missing values for reach difference and age using relevant features like weight class and height difference. We then fill in missing stance values with the most commonly occurring stance. Next, we handle any new, unseen categories (unique values of categorical variables, such as stance) by creating a new level and convert all categorical variables into dummy variables. We remove predictors with near-zero variance and normalize all predictors to ensure they’re on a similar scale. Finally, we apply this recipe to our training data, making it clean and ready for the model.

matchup_recipe <-
# Recipe for the 'winner' variable using our training data
  recipe(winner ~ ., data = matchup_train) %>% 
# Estimate any missing values of reach difference using that row's weight class and height difference values
  step_impute_linear(reach_diff,  impute_with = imp_vars(weight_class, height_diff)) %>%
# Estimate any missing values of age using that row's weight class value
  step_impute_linear(r_age, impute_with = imp_vars(weight_class)) %>%
  step_impute_linear(b_age, impute_with = imp_vars(weight_class)) %>%
# Replace any missing stance values with the most commonly occurring value of stance
  step_impute_mode(r_stance, b_stance) %>%
# Create a new level for any unseen factor levels in the nominal predictors
  step_novel(all_nominal_predictors()) %>% 
# Convert all nominal predictors to dummy variables
  step_dummy(all_nominal_predictors()) %>% 
# Remove predictors with near-zero variance
  step_nzv(all_predictors()) %>% 
# Normalize all predictors
  step_normalize(all_predictors())
  
# Prepare the recipe and apply it to the training data
prep(matchup_recipe) %>% bake(matchup_train)

K-Fold Cross Validation

We are going to use stratified cross validation to help with the issue of imbalanced data. This step turns our training data into 10 subsets, called ‘folds’. This allows us to train and validate our model on different portions of the data. We will stratify on our response variable, winner, ensuring that each fold has a similar distribution of winners. This prevents overfitting, similarly to how our original train / test split prevented overfitting.

matchup_folds <- vfold_cv(matchup_train, v = 10, strata = winner)  # 10-fold CV

Model Building

Time for the most important part of the project: building our models. As previously stated in the introduction, we will be trying out five different machine learning techniques, all using the same recipe. I decided to use ROC AUC as the success metric because it shows the most significant level of efficiency in a binary classification model, especially when the data is not perfectly balanced. The ROC AUC (Receiver Operating Characteristic Area Under the Curve) measures the area under the ROC curve, which plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) at various threshold settings. An ROC AUC value can be from 0-1, with 0.5 being as accurate as random choice. A higher ROC AUC value indicates a better model, as it signifies a good balance between sensitivity (true positive rate) and specificity (true negative rate), effectively capturing the model’s ability to distinguish between the two classes.

Nearly every model built had the same process, which I will detail right now.

    1. Set up the model by specifying what type of model, setting its engine, and setting its mode (which was always classification)
    1. Set up the workflow, add the new model, and add the established recipe.

Skip steps 3-5 for Logistic Regression and LDA.

    1. Set up the tuning grid with the parameters that we want tuned, and how many different levels of tuning
    1. Tune the model with certain parameters of choice
    1. Select the most accurate model from all of the tuning, finalize the workflow with those tuning parameters
    1. Fit that model with our workflow to the training data set
    1. Save our results to an RDA file so we do not have to spend time running it over and over again
# Logistic Regression Model

log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

matchup_log_wkflow <- workflow() %>% 
  add_model(log_reg) %>% 
  add_recipe(matchup_recipe)

final_log_fit <- fit_resamples(matchup_log_wkflow, matchup_folds)

save(final_log_fit, file = "/Users/bennettbishop/Downloads/UFC-Final-Project/Prep-RDA/final_log_fit.rda")
# Linear Discriminant Analysis Model

lda_mod <- discrim_linear() %>% 
  set_mode("classification") %>% 
  set_engine("MASS")

matchup_lda_wkflow <- workflow() %>% 
  add_model(lda_mod) %>% 
  add_recipe(matchup_recipe)

final_lda_fit <- fit_resamples(matchup_lda_wkflow, matchup_folds)

save(final_lda_fit, file = "/Users/bennettbishop/Downloads/UFC-Final-Project/Prep-RDA/final_lda_fit.rda")
# Decision Tree Model 

matchup_tree_spec <- decision_tree(cost_complexity = tune()) %>%
  set_engine("rpart") %>% 
  set_mode("classification")

matchup_tree_wf <- workflow() %>% 
  add_model(matchup_tree_spec) %>% 
  add_recipe(matchup_recipe)

param_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)

tune_tree <- tune_grid(
  matchup_tree_wf, 
  resamples = matchup_folds, 
  grid = param_grid
)

best_complexity <- select_best(tune_tree, metric = 'roc_auc')

matchup_tree_final <- finalize_workflow(matchup_tree_wf, best_complexity)

final_tree_fit <- fit_resamples(matchup_tree_final, matchup_folds)
 
save(final_tree_fit, tune_tree, file = "/Users/bennettbishop/Downloads/UFC-Final-Project/Prep-RDA/final_tree_fit.rda")
# Random Forest Model 

rf_matchup_spec <- rand_forest(mtry = tune(), 
                           trees = tune(), 
                           min_n = tune()) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

rf_matchup_wf <- workflow() %>% 
  add_model(rf_matchup_spec) %>% 
  add_recipe(matchup_recipe)

rf_grid <- grid_regular(mtry(range = c(1, 6)), 
                        trees(range = c(200, 600)),
                        min_n(range = c(10, 20)),
                        levels = 5)

tune_rf <- tune_grid(
  rf_matchup_wf,
  resamples = matchup_folds,
  grid = rf_grid
)

#best rf (hyper-parameters) across all folds
best_matchup_rf <- select_best(tune_rf, metric='roc_auc')

#taking that rf, fitting it over folds again (redundant)
final_rf_model <- finalize_workflow(rf_matchup_wf, best_matchup_rf)

final_rf_fit <- fit_resamples(final_rf_model, matchup_folds)

save(best_matchup_rf, final_rf_fit, final_rf_model, tune_rf, file = "/Users/bennettbishop/Downloads/UFC-Final-Project/Prep-RDA/final_rf_fit.rda")
# K-Nearest Neighbor Model

knn_matchup_model <- nearest_neighbor(neighbors = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

knn_matchup_wflow <- workflow() %>% 
  add_model(knn_matchup_model) %>% 
  add_recipe(matchup_recipe)

neighbors_grid <- grid_regular(neighbors(range = c(1, 10)), levels = 10)

tune_knn <- tune_grid(
  object = knn_matchup_wflow, 
  resamples = matchup_folds, 
  grid = neighbors_grid,
  control = control_grid()
)


best_neighbors <- select_best(tune_knn, metric='roc_auc')

final_matchup_wf <- finalize_workflow(knn_matchup_wflow, best_neighbors)

final_knn_fit <- fit_resamples(final_matchup_wf, matchup_folds)

save(final_knn_fit, tune_knn, file = "/Users/bennettbishop/Downloads/UFC-Final-Project/Prep-RDA/final_knn_fit.rda")

Accuracy of Our Models

Finally, we have created five unique predictive model types, fit our data to the models, and selected the most successful model of each type. Let’s load up our models and take a look at our results!

#Folds, recipe, training set, testing set
load("/Users/bennettbishop/Downloads/UFC-Final-Project/Prep-RDA/matchup_components.rda")
#Logistic Regression Model
load("/Users/bennettbishop/Downloads/UFC-Final-Project/Prep-RDA/final_log_fit.rda")
#LDA Model
load("/Users/bennettbishop/Downloads/UFC-Final-Project/Prep-RDA/final_lda_fit.rda")
#Decision Tree Model
load("/Users/bennettbishop/Downloads/UFC-Final-Project/Prep-RDA/final_tree_fit.rda")
#Random Forest Model
load("/Users/bennettbishop/Downloads/UFC-Final-Project/Prep-RDA/final_rf_fit.rda")
#KNN Model
load("/Users/bennettbishop/Downloads/UFC-Final-Project/Prep-RDA/final_knn_fit.rda")
log_reg_auc <- collect_metrics(final_log_fit)  %>%
  filter(.metric == "roc_auc") 
lda_auc <- collect_metrics(final_lda_fit)  %>%
  filter(.metric == "roc_auc") 
tree_auc <- collect_metrics(final_tree_fit)  %>%
  filter(.metric == "roc_auc") 
rf_auc <- collect_metrics(final_rf_fit)  %>%
  filter(.metric == "roc_auc") 
knn_auc <- collect_metrics(final_knn_fit)  %>%
  filter(.metric == "roc_auc") 


matchup_model_names = c("Logistic Regression", "LDA", "Decision Tree", "Random Forest", "K-Nearest Neighbor")
matchup_ROC_AUCs = c(log_reg_auc$mean, lda_auc$mean, tree_auc$mean, rf_auc$mean, knn_auc$mean)

# Print the results
matchup_results <- tibble(Model = matchup_model_names,
                             ROC_AUC = matchup_ROC_AUCs)

matchup_results <- matchup_results %>% 
  dplyr::arrange(-matchup_ROC_AUCs)

matchup_results
## # A tibble: 5 × 2
##   Model               ROC_AUC
##   <chr>                 <dbl>
## 1 Random Forest         0.782
## 2 Logistic Regression   0.773
## 3 LDA                   0.771
## 4 Decision Tree         0.701
## 5 K-Nearest Neighbor    0.659

To help visualize these results, I have included a bar graph below. Remember, we want the model with the highest ROC AUC:

matchup_bar_plot <- ggplot(matchup_results, 
       aes(x = Model, y = ROC_AUC)) + 
  geom_bar(stat = "identity", width=0.2, fill = "goldenrod", color = "darkgray") + 
  labs(title = "Performance of Our Models") + 
  theme_minimal()

matchup_lollipop_plot <- ggplot(matchup_results, aes(x = Model, y = ROC_AUC)) + 
    geom_segment( aes(x = Model, xend = 0, y = ROC_AUC, yend = 0)) +
  geom_point(, color= "darkgray", fill=alpha("goldenrod", 0.3), alpha=0.7, shape=21, stroke=3)+
  labs(title = "Performance of Our Models") + 
  theme_minimal()

matchup_dot_plot <- ggplot(matchup_results, aes(x = Model, y = ROC_AUC)) +
  geom_point(fill = "darkgray", col = "goldenrod") + 
  geom_segment(aes(x = Model, 
                   xend = Model, 
                   y=min(ROC_AUC), 
                   yend = max(ROC_AUC)), 
               linetype = "dashed", 
               size=0.5) + 
  labs(title = "Performance of Our Models") + 
  theme_minimal() +
  coord_flip()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
matchup_bar_plot

The graph above shows that the best random forest model we built is the top performing model overall. This will be the model we use going forward to fit to our testing data, and analyze its true performance on.

Results from Best Model

Congratulations, Random Forest #1!

It looks like our very best model is Random Forest #1!

Random forests work by creating many decision trees during training time and outputting the mode of the classes (classification) of the individual trees. Each tree is built from a random subset of the training data and features, which helps to reduce overfitting and improve generalization. By averaging the results from multiple trees and making a final prediction, random forests provide more accurate and robust predictions compared to a single decision tree. The Random Forest has 3 hyperparameters:

  • mtry: This represents the number of predictors which are randomly sampled at each split when creating each tree model.

  • trees: This represents the number of trees in each forest model.

  • min_n: The minimum number of data points in a node required for the node to be split further.

It seems that model #1 of the 100s of Random Forest models is the best out of all five different techniques we implemented across the board. We can see the specific parameters of Random Forest 1 below. Our best forest contains 500 trees!

print(best_matchup_rf)
## # A tibble: 1 × 4
##    mtry trees min_n .config               
##   <int> <int> <int> <chr>                 
## 1     4   500    15 Preprocessor1_Model069

This model has an ROC AUC of 0.7822 over the training data.

final_fit_metrics <- collect_metrics(final_rf_fit) %>% 
  filter(.metric == "roc_auc")
final_fit_metrics
## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.782    10 0.00772 Preprocessor1_Model1

Itttttssssssssss time! Now, we can go ahead and introduce our model to the never-before-seen testing data set and see how well our model does with new data!

knitr::include_graphics("its-time-gif.gif")

#Fit Best Model over all Training Data
final_rf_fit2 <- final_rf_model %>% fit(matchup_train)

#Use Model to Predict outcome of Testing Data
matchup_predict <- predict(final_rf_fit2, 
                             new_data = matchup_test, 
                            type = "class")

And just like that, we have exposed our best model to our testing data! Below, you can see the predicted value of winner next to the actual value of winner. This just happens to be the first ten rows (out of 1442), but it’s pretty neat to see it predicted 7/10 of those row’s winners correctly!

# Add Actual/Predicted Values Side By Side For Our Viewing
matchup_predict_with_actual <- matchup_predict %>%
 bind_cols(matchup_test) 

matchup_predict_with_actual

Lets take a more in depth look at our results.

ROC Curve

Let’s take a look at the model’s ROC curve before we view the final results of model 1’s ROC AUC! The graph looks good, as we want the curve to follow a trajectory that is as up and to the left as possible. One way to think about it is the more our curve looks like a capital gamma, the greek letter Γ, the better.

# Ensure the 'winner' column is a factor
matchup_test <- matchup_test %>%
  mutate(winner = as.factor(winner))

# Compute the ROC curve
matchup_roc_curve <- augment(final_rf_fit2, new_data = matchup_test) %>%
  roc_curve(truth = winner, .pred_Blue)

# Plot the ROC curve
autoplot(matchup_roc_curve)

Our curve looks good!

Final ROC AUC Curve

Lets see our ROC’s AUC for Random Forest 1?

matchup_roc_auc <- augment(final_rf_fit2, new_data = matchup_test) %>%
  roc_auc(winner, .pred_Blue) %>%
  select(.estimate)  # computing the AUC for the ROC curve

matchup_roc_auc
## # A tibble: 1 × 1
##   .estimate
##       <dbl>
## 1     0.791

Our model has an ROC_AUC of 0.7897! In the world of statistics, an ROC_AUC of around 0.8 is consider to be good in terms of measuring the model’s performance. Let’s Celebrate!

Visualizing Model Performance

Lets explore the capabilities of our model!

Model Success By Weight Class

#Edited from Luke Fields' Project

weight_classes <- c("Heavyweight", "Light Heavyweight","Middleweight", 
                     "Welterweight", "Lightweight", "Featherweight",
                     "Bantamweight","Flyweight", "Women's Featherweight", 
                    "Women's Bantamweight", "Women's Flyweight", "Women's Strawweight")

augmented_matchup <- augment(final_rf_fit2, new_data = matchup_test)

weight_class_roc_auc <- function(weight_class){
  estimate_column <- (augmented_matchup %>%
                        filter(weight_class == !!weight_class) %>%
                        roc_auc(truth = winner, .pred_Blue) %>%
                        select(.estimate))
  estimate <- estimate_column$.estimate
  return(estimate)
}

weight_class_roc_auc_scores <- vector("numeric", length(weight_classes))

for(i in seq_along(weight_classes)){
  weight_class_roc_auc_scores[i] <- weight_class_roc_auc(weight_classes[i])
}

matchup_roc_auc_by_weight_class <- tibble('Weight Class' = weight_classes, ROC_AUC = weight_class_roc_auc_scores)

print(matchup_roc_auc_by_weight_class)
## # A tibble: 12 × 2
##    `Weight Class`        ROC_AUC
##    <chr>                   <dbl>
##  1 Heavyweight             0.767
##  2 Light Heavyweight       0.854
##  3 Middleweight            0.774
##  4 Welterweight            0.783
##  5 Lightweight             0.808
##  6 Featherweight           0.729
##  7 Bantamweight            0.788
##  8 Flyweight               0.802
##  9 Women's Featherweight   0.75 
## 10 Women's Bantamweight    0.776
## 11 Women's Flyweight       0.767
## 12 Women's Strawweight     0.865

Let’s take a look at this distribution.

ggplot(matchup_roc_auc_by_weight_class, aes(x = ROC_AUC, y = reorder(`Weight Class`, ROC_AUC))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "ROC AUC by Weight Class", x = "ROC AUC", y = "Weight Class") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Pretty cool! It looks like our model was best at predicting Women’s Strawweight (115lb or under) and Light Heavyweight (205lb). This is interesting as these two classes are almost on opposite sides of the weight spectrum. I originally thought that possibly there is an extra large or extra small number of rows at each these weight classes, but as can be seen table below, there is not.

weight_class_counts <- matchup_test %>%
  group_by(weight_class) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

weight_class_counts
## # A tibble: 12 × 2
##    weight_class          count
##    <chr>                 <int>
##  1 Lightweight             246
##  2 Welterweight            241
##  3 Middleweight            216
##  4 Bantamweight            136
##  5 Heavyweight             134
##  6 Featherweight           133
##  7 Light Heavyweight       114
##  8 Flyweight                59
##  9 Women's Strawweight      59
## 10 Women's Bantamweight     53
## 11 Women's Flyweight        45
## 12 Women's Featherweight     6

Variable Importance Chart

Lets see a variable importance chart.

final_rf_fit2 %>%
  extract_fit_parsnip() %>% 
  vip(aesthetics = list(fill = "goldenrod", color = "gray")) 

As we can see above, the most important two variables in predicting the winner of the fight were SLpM_diff, the difference in the fighters significant strikes landed per minute, and the career_wins_diff, the difference in the fighter’s number of career wins. It makes sense to me that significant strikes landed is such an important statistic. Landing more significant strikes than the other fighter helps in every case. You are more likely to get a KO, and even if you do not, and the fight goes to decision, you will be more likely to win the decision on the judges score card. It also makes sense to me that career_wins_diff is an important factor. A large gap in career wins either means that one fighter has significantly more experience than the other, or they have simply won more of the fights they have been offered. I was, however, surprised that career_losses_diff falls so low on the list of importance. As we saw earlier, in our correlation matrix, career_losses_diff is inversely related to career_wins_diff.

Conclusion

Well, we’re done for now! We gathered and tidied our data, we made a recipe, we fit that recipe to five different model types, we found the very best model, and were able to fit that model to testing data with an ROC AUC of 78.22%! Working on this project vastly expanded my knowledge of machine learning, and I found myself growing more and more excited about the project as it progressed. This was a lot of fun, and I’ll definitely use this model to predict some upcoming fights.

There are a number of goals I have for this project which I have not yet achieved. First of all, I want to spend more time searching through our results to see if our machine correctly predicted some of my favorite fights. For example, in Nate Diaz vs Conor McGregor 1, Diaz famously upset McGregor in the second round, submitting him via Guillotine Choke. I know that that fight is in our data set, but I need to find a way to locate it, possibly by going back and variables “date” or “r_name” / “b_name” back to our recipe. Those variables are in the original dataset, but I did not select them for our project as they were not relevant. Adding them back will allow me to search for specific fights of interest and check how our model performed. I also want to not only be able to predict ‘winner’, but also the way the fight was won (decision, knockout, submission). This information is in the data set, but again, predicting it was out of the scope of this project. By being able to check our model’s results for my favorite fights, and by being able to predict the way the fight was won, I will be even more fulfilled by this project.

Thank you for checking out my UFC fight prediction project!